-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow postponing dataset integrity checks in NextGenHDFDataset
#1323
base: master
Are you sure you want to change the base?
Conversation
NextGenHDFDataset
8936027
to
a00d0ec
Compare
Why do you put "chore:" into the description? Where do you get this from? This is inconsistent to our normal commit messages, so we should not use this. |
Does this really need to be a new option? I personally prefer to have fewer options if possible, esp if not really needed. Can't we just always do those integrity checks lazily? |
Please check failing tests.
Please use PyCharm to directly see those inspections. |
I guess @JackTemaki and/or @patrick-wilken should otherwise review this. |
NextGenHDFDataset
NextGenHDFDataset
2f08d76
to
a8ecbbf
Compare
I don't really have an opinion for or against that here -- it's mainly detecting data format errors early vs. saving time. If you're sure your data is good there is no reason to do these checks eagerly. |
Follow-up from #1315, where @JackTemaki noticed the
NextGenHDFDataset
goes through the entire data on startup to perform integrity checks -- and indeed RETURNN startup w/ that dataset is quite slow. This PR allows delaying these checks to training time, saving startup time at the cost of detecting data errors only later on.